Introduction

1. Overview

This tutorial demonstrates how to use R to perform data analytics tasks covered in Week 02. We’ll explore:

  • Descriptive Analytics: Understanding what happened
  • Predictive Analytics: Predicting what could happen
  • Prescriptive Analytics: Determining what should happen

2. Learning Objectives

By the end of this tutorial, you will be able to:

  1. Load and explore datasets in R
  2. Calculate descriptive statistics (central tendency, dispersion)
  3. Create frequency distributions and visualizations
  4. Understand different variable types and measurements
  5. Build simple predictive models
  6. Apply the research process using R

3. Required Packages

# Install packages if needed (uncomment to install)
# install.packages(c("tidyverse", "ggplot2", "dplyr", "knitr", "kableExtra"))
# install.packages("seedhash")  # For reproducible seed generation

# Load required packages
library(tidyverse)    # Data manipulation and visualization
library(ggplot2)      # Advanced plotting
library(dplyr)        # Data manipulation
library(knitr)        # Table formatting
library(kableExtra)   # Enhanced tables
library(seedhash)     # Reproducible seed generation

# Display seed information
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")
## 
## === REPRODUCIBLE SEED INFORMATION ===
cat("\nGenerator Name: A dog is a man's best friend")
## 
## Generator Name: A dog is a man's best friend
cat("\nMD5 Hash:", gen$get_hash())
## 
## MD5 Hash: 0e112cb70d324ea5e5b9f803b4488bde
cat("\nAvailable Seeds:", paste(seeds[1:5], collapse = ", "), "...\n\n")
## 
## Available Seeds: -805293604, -909845485, -339394968, 586956699, -960637137 ...

Part 1: Understanding Data Types

1.1 Variable Types in R

R handles different types of variables. Let’s explore them:

Key term – measurement level: Field, Miles, and Field (2012, ch. 2) emphasize that correctly classifying each variable’s measurement level determines which descriptive summaries and hypothesis tests are valid. Treating an ordinal scale (for example, satisfaction ratings) as if it were interval can distort estimates because the distance between categories is not guaranteed to be equal (Field et al., 2012).

# Numeric (continuous) variables
age <- 25
height <- 175.5
temperature <- 98.6

# Integer variables
num_students <- 30L
year <- 2025L

# Character (string) variables
name <- "John Doe"
city <- "Washington DC"

# Logical (boolean) variables
is_student <- TRUE
passed_exam <- FALSE

# Factor (categorical) variables
education_level <- factor(c("High School", "Bachelor", "Master", "PhD"))
grade <- factor(c("A", "B", "C", "D", "F"), ordered = TRUE)

# Check variable types
cat("Type of age:", class(age), "\n")
## Type of age: numeric
cat("Type of name:", class(name), "\n")
## Type of name: character
cat("Type of is_student:", class(is_student), "\n")
## Type of is_student: logical
cat("Type of education_level:", class(education_level), "\n")
## Type of education_level: factor

1.2 Categorical Variables

Categorical variables capture group membership. Field et al. (2012, ch. 2) note that we summarize them with counts or proportions and we model them with tests designed for frequencies (such as the chi-square test introduced later in Part 5).

1.2.1 Binary Variables

Binary (or dichotomous) variables only have two possible outcomes (Field et al., 2012). Because every case must fall into exactly one category, proportions are a natural summary and form the foundation for later logistic models.

# Binary variable example: Pass/Fail
pass_fail <- c("Pass", "Fail", "Pass", "Pass", "Fail")
pass_fail_factor <- factor(pass_fail)

cat("Binary variable levels:", levels(pass_fail_factor), "\n")
## Binary variable levels: Fail Pass
table(pass_fail_factor)
## pass_fail_factor
## Fail Pass 
##    2    3

1.2.2 Nominal Variables

Nominal scales label unordered categories such as department names. Field et al. (2012) highlight that although the numbers we assign to these categories are arbitrary, the counts still convey how often each outcome appears.

# Nominal: Categories without order
departments <- factor(c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"))

cat("Department frequency:\n")
## Department frequency:
table(departments)
## departments
##   Finance        HR        IT Marketing 
##         2         2         2         1
# Visualize
barplot(table(departments), 
        main = "Department Distribution",
        col = "steelblue",
        ylab = "Frequency",
        las = 1)

1.2.3 Ordinal Variables

Ordinal variables preserve order but not equal spacing between categories (Field et al., 2012). Treat medians or percentiles as your primary summaries because means can hide the unequal steps.

# Ordinal: Categories with logical order
satisfaction <- factor(
  c("Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied",
    "Satisfied", "Neutral", "Very Satisfied", "Satisfied", "Unsatisfied"),
  levels = c("Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied"),
  ordered = TRUE
)

cat("Satisfaction levels:\n")
## Satisfaction levels:
print(table(satisfaction))
## satisfaction
## Very Unsatisfied      Unsatisfied          Neutral        Satisfied 
##                1                2                2                3 
##   Very Satisfied 
##                2
# Visualize with ordered categories
barplot(table(satisfaction),
        main = "Customer Satisfaction Survey",
        col = rainbow(5),
        ylab = "Count",
        las = 2,
        cex.names = 0.8)

1.3 Continuous Variables

Continuous variables have meaningful numeric scales, allowing arithmetic operations. Field et al. (2012) distinguish between interval scales (equal steps without a true zero) and ratio scales (equal steps with a meaningful zero), which guides whether ratios like “twice as much” are interpretable.

1.3.1 Interval Variables

# Interval: Equal intervals but no true zero (e.g., temperature in Celsius)
temperatures <- c(15, 18, 22, 25, 20, 16, 19, 23, 21, 17)

cat("Temperature Statistics:\n")
## Temperature Statistics:
cat("Mean:", mean(temperatures), "°C\n")
## Mean: 19.6 °C
cat("Median:", median(temperatures), "°C\n")
## Median: 19.5 °C
cat("Range:", range(temperatures)[1], "to", range(temperatures)[2], "°C\n")
## Range: 15 to 25 °C
# Note: 0°C doesn't mean "no temperature"
hist(temperatures,
     main = "Temperature Distribution",
     xlab = "Temperature (°C)",
     col = "lightblue",
     border = "white")

1.3.2 Ratio Variables

# Ratio: Equal intervals AND true zero (e.g., height, weight, income)
heights <- c(160, 175, 182, 168, 155, 190, 172, 165, 178, 185)

cat("Height Statistics:\n")
## Height Statistics:
cat("Mean:", mean(heights), "cm\n")
## Mean: 173 cm
cat("Median:", median(heights), "cm\n")
## Median: 173.5 cm
cat("Note: 0 cm would mean no height (true zero)\n")
## Note: 0 cm would mean no height (true zero)
# Ratio calculations make sense
cat("\nRatio example: 180cm is", 180/90, "times taller than 90cm\n")
## 
## Ratio example: 180cm is 2 times taller than 90cm
hist(heights,
     main = "Height Distribution",
     xlab = "Height (cm)",
     col = "coral",
     border = "white")


Part 2: Descriptive Analytics

2.1 Loading Data

Let’s work with built-in R datasets and real examples:

# Load built-in datasets
data(iris)        # Famous flower dataset
data(mtcars)      # Motor cars data
data(airquality)  # Air quality measurements

# Preview datasets
cat("Iris dataset (first 6 rows):\n")
## Iris dataset (first 6 rows):
head(iris) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

2.2 Central Tendency Measures

2.2.1 The Mean

Field et al. (2012, ch. 4) describe the mean as the balance point of a distribution: the point at which the total distance of scores above equals the total distance below. Because it uses every score, the mean is efficient but sensitive to outliers, so always pair it with a visualization to check for extreme values.

# Calculate mean
sepal_lengths <- iris$Sepal.Length

mean_value <- mean(sepal_lengths)
cat("Mean Sepal Length:", round(mean_value, 2), "cm\n")
## Mean Sepal Length: 5.84 cm
# Visualize with mean line
hist(sepal_lengths,
     main = "Sepal Length Distribution with Mean",
     xlab = "Sepal Length (cm)",
     col = "lightgreen",
     border = "white")
abline(v = mean_value, col = "red", lwd = 3, lty = 2)
legend("topright", legend = paste("Mean =", round(mean_value, 2)),
       col = "red", lty = 2, lwd = 3)

Understanding the Mean:

\[\bar{X} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]

# Manual calculation to understand the formula
sum_values <- sum(sepal_lengths)
n_values <- length(sepal_lengths)
manual_mean <- sum_values / n_values

cat("Sum of values:", sum_values, "\n")
## Sum of values: 876.5
cat("Number of values:", n_values, "\n")
## Number of values: 150
cat("Manual mean calculation:", round(manual_mean, 2), "\n")
## Manual mean calculation: 5.84
cat("R's mean() function:", round(mean(sepal_lengths), 2), "\n")
## R's mean() function: 5.84

2.2.2 The Median

The median splits the ordered data in half. Field et al. (2012) note that because it is based on rank rather than magnitude, the median resists the influence of extreme scores and is a better summary than the mean for skewed or ordinal data.

# Calculate median
median_value <- median(sepal_lengths)

cat("Median Sepal Length:", round(median_value, 2), "cm\n")
## Median Sepal Length: 5.8 cm
cat("Mean Sepal Length:", round(mean_value, 2), "cm\n")
## Mean Sepal Length: 5.84 cm
cat("Difference:", round(abs(mean_value - median_value), 3), "cm\n")
## Difference: 0.043 cm
# Visualize both
hist(sepal_lengths,
     main = "Sepal Length: Mean vs Median",
     xlab = "Sepal Length (cm)",
     col = "skyblue",
     border = "white")
abline(v = mean_value, col = "red", lwd = 3, lty = 2)
abline(v = median_value, col = "blue", lwd = 3, lty = 2)
legend("topright", 
       legend = c(paste("Mean =", round(mean_value, 2)),
                  paste("Median =", round(median_value, 2))),
       col = c("red", "blue"), 
       lty = 2, 
       lwd = 3)

Finding the Median:

For odd number of values: Middle value For even number of values: Average of two middle values

# Example with manual calculation
facebook_friends <- c(22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252)
sorted_friends <- sort(facebook_friends)

cat("Sorted values:", sorted_friends, "\n")
## Sorted values: 22 40 53 57 93 98 103 108 116 121 252
cat("Position of median: (n+1)/2 = (11+1)/2 =", (length(sorted_friends)+1)/2, "\n")
## Position of median: (n+1)/2 = (11+1)/2 = 6
cat("Median value:", median(facebook_friends), "\n")
## Median value: 98
# Remove extreme value
friends_no_extreme <- facebook_friends[facebook_friends != 252]
cat("\nWithout extreme value (252):\n")
## 
## Without extreme value (252):
cat("New median:", median(friends_no_extreme), "\n")
## New median: 95.5
cat("Change:", median(facebook_friends) - median(friends_no_extreme), "\n")
## Change: 2.5

2.2.3 The Mode

The mode identifies the most frequently occurring category or value. Field et al. (2012) recommend reporting it alongside the median for ordinal data because it highlights the most common response option.

# R doesn't have a built-in mode function, so we create one
get_mode <- function(x) {
  unique_vals <- unique(x)
  freq_table <- tabulate(match(x, unique_vals))
  unique_vals[which.max(freq_table)]
}

# Example with discrete data
grades <- c("A", "B", "B", "C", "B", "A", "C", "B", "D", "B", "C")
mode_grade <- get_mode(grades)

cat("Mode (most frequent grade):", mode_grade, "\n")
## Mode (most frequent grade): B
cat("\nFrequency table:\n")
## 
## Frequency table:
print(table(grades))
## grades
## A B C D 
## 2 5 3 1
# Visualize
barplot(table(grades),
        main = "Grade Distribution (Mode in Red)",
        col = ifelse(names(table(grades)) == mode_grade, "red", "steelblue"),
        ylab = "Frequency")

2.2.4 Comparing Central Tendency Measures

# Create a comparison table
iris_summary <- iris %>%
  group_by(Species) %>%
  summarise(
    Mean = mean(Sepal.Length),
    Median = median(Sepal.Length),
    Min = min(Sepal.Length),
    Max = max(Sepal.Length),
    N = n()
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

iris_summary %>%
  kable(caption = "Central Tendency by Species") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Central Tendency by Species
Species Mean Median Min Max N
setosa 5.01 5.0 4.3 5.8 50
versicolor 5.94 5.9 4.9 7.0 50
virginica 6.59 6.5 4.9 7.9 50

2.3 Measures of Dispersion

2.3.1 Range

The range communicates the full spread of the data (max − min). Field et al. (2012) point out that it is easy to compute yet heavily influenced by a single extreme value, so treat it as a quick diagnostic rather than a robust statistic.

# Calculate range
range_vals <- range(sepal_lengths)
range_value <- diff(range_vals)

cat("Minimum:", range_vals[1], "\n")
## Minimum: 4.3
cat("Maximum:", range_vals[2], "\n")
## Maximum: 7.9
cat("Range:", range_value, "\n")
## Range: 3.6
# Effect of extreme values
cat("\nEffect of extreme values:\n")
## 
## Effect of extreme values:
cat("Original range:", diff(range(facebook_friends)), "\n")
## Original range: 230
cat("Without extreme (252):", diff(range(friends_no_extreme)), "\n")
## Without extreme (252): 99
cat("Reduction:", diff(range(facebook_friends)) - diff(range(friends_no_extreme)), "\n")
## Reduction: 131

2.3.2 Interquartile Range (IQR)

Field et al. (2012) advocate using the interquartile range when you want the spread of the middle 50% of scores. Because it ignores the most extreme quartiles, the IQR pairs naturally with the median for skewed distributions.

# Calculate quartiles
q1 <- quantile(sepal_lengths, 0.25)
q2 <- quantile(sepal_lengths, 0.50)  # This is the median
q3 <- quantile(sepal_lengths, 0.75)
iqr_value <- IQR(sepal_lengths)

cat("Lower Quartile (Q1):", q1, "\n")
## Lower Quartile (Q1): 5.1
cat("Median (Q2):", q2, "\n")
## Median (Q2): 5.8
cat("Upper Quartile (Q3):", q3, "\n")
## Upper Quartile (Q3): 6.4
cat("Interquartile Range (IQR):", iqr_value, "\n")
## Interquartile Range (IQR): 1.3
# Boxplot shows quartiles
boxplot(sepal_lengths,
        main = "Boxplot Showing Quartiles",
        ylab = "Sepal Length (cm)",
        col = "lightblue",
        horizontal = TRUE)
text(x = q1, y = 1.3, labels = "Q1", col = "red", cex = 1.2)
text(x = q2, y = 1.3, labels = "Q2 (Median)", col = "red", cex = 1.2)
text(x = q3, y = 1.3, labels = "Q3", col = "red", cex = 1.2)

2.3.3 Variance and Standard Deviation

Variance averages the squared deviations from the mean, while the standard deviation returns to the original units by taking the square root (Field et al., 2012). These statistics assume an interval or ratio scale and provide the backbone for inferential models, including the t tests in Part 5.

# Calculate variance and standard deviation
var_value <- var(sepal_lengths)
sd_value <- sd(sepal_lengths)

cat("Variance:", round(var_value, 4), "\n")
## Variance: 0.6857
cat("Standard Deviation:", round(sd_value, 4), "\n")
## Standard Deviation: 0.8281
cat("\nInterpretation:\n")
## 
## Interpretation:
cat("On average, sepal lengths deviate by", round(sd_value, 2), "cm from the mean.\n")
## On average, sepal lengths deviate by 0.83 cm from the mean.
# Visualize with standard deviation bands
mean_val <- mean(sepal_lengths)
hist(sepal_lengths,
     main = "Distribution with Standard Deviation",
     xlab = "Sepal Length (cm)",
     col = "lavender",
     border = "white")
abline(v = mean_val, col = "red", lwd = 2)
abline(v = mean_val - sd_value, col = "blue", lwd = 2, lty = 2)
abline(v = mean_val + sd_value, col = "blue", lwd = 2, lty = 2)
legend("topright", 
       legend = c("Mean", "± 1 SD"),
       col = c("red", "blue"), 
       lty = c(1, 2), 
       lwd = 2)

Understanding Variance and Standard Deviation:

\[s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}\]

\[s = \sqrt{s^2}\]

# Manual calculation to understand the formula
deviations <- sepal_lengths - mean(sepal_lengths)
squared_deviations <- deviations^2
sum_squared_dev <- sum(squared_deviations)
manual_variance <- sum_squared_dev / (length(sepal_lengths) - 1)
manual_sd <- sqrt(manual_variance)

cat("Manual variance:", round(manual_variance, 4), "\n")
## Manual variance: 0.6857
cat("R's var():", round(var(sepal_lengths), 4), "\n")
## R's var(): 0.6857
cat("Manual SD:", round(manual_sd, 4), "\n")
## Manual SD: 0.8281
cat("R's sd():", round(sd(sepal_lengths), 4), "\n")
## R's sd(): 0.8281

2.3.4 Complete Dispersion Summary

# Comprehensive dispersion measures
dispersion_summary <- data.frame(
  Measure = c("Range", "IQR", "Variance", "Std Dev", "Coef of Variation"),
  Value = c(
    diff(range(sepal_lengths)),
    IQR(sepal_lengths),
    var(sepal_lengths),
    sd(sepal_lengths),
    sd(sepal_lengths) / mean(sepal_lengths) * 100
  ),
  Unit = c("cm", "cm", "cm²", "cm", "%")
)

dispersion_summary %>%
  mutate(Value = round(Value, 3)) %>%
  kable(caption = "Measures of Dispersion for Sepal Length") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Measures of Dispersion for Sepal Length
Measure Value Unit
Range 3.600 cm
IQR 1.300 cm
Variance 0.686 cm²
Std Dev 0.828 cm
Coef of Variation 14.171 %

2.4 Frequency Distributions

# Create frequency table
sepal_length_cut <- cut(sepal_lengths, breaks = 8)
freq_table <- table(sepal_length_cut)

cat("Frequency Distribution:\n")
## Frequency Distribution:
print(freq_table)
## sepal_length_cut
## (4.3,4.75] (4.75,5.2] (5.2,5.65] (5.65,6.1] (6.1,6.55]   (6.55,7]   (7,7.45] 
##         11         34         20         30         25         18          6 
## (7.45,7.9] 
##          6
# Create histogram with frequency
hist(sepal_lengths,
     main = "Frequency Distribution of Sepal Length",
     xlab = "Sepal Length (cm)",
     ylab = "Frequency",
     col = "steelblue",
     border = "white",
     breaks = 12)

# Add density curve
lines(density(sepal_lengths), col = "red", lwd = 3)
legend("topright", legend = "Density Curve", col = "red", lwd = 3)

2.4.1 Distribution Shapes

# Create different distribution shapes
par(mfrow = c(2, 2))

# Normal distribution
normal_data <- rnorm(1000, mean = 50, sd = 10)
hist(normal_data, main = "Normal Distribution", 
     col = "lightgreen", border = "white", xlab = "Values")

# Right-skewed (positive skew)
skewed_right <- rexp(1000, rate = 0.5)
hist(skewed_right, main = "Right Skewed (Positive)", 
     col = "lightcoral", border = "white", xlab = "Values")

# Left-skewed (negative skew)
skewed_left <- -rexp(1000, rate = 0.5)
hist(skewed_left, main = "Left Skewed (Negative)", 
     col = "lightblue", border = "white", xlab = "Values")

# Uniform distribution
uniform_data <- runif(1000, min = 0, max = 100)
hist(uniform_data, main = "Uniform Distribution", 
     col = "lavender", border = "white", xlab = "Values")

par(mfrow = c(1, 1))

2.4.2 Describing Distribution Shape

library(moments)  # For skewness and kurtosis

# Calculate shape statistics
skewness_val <- skewness(sepal_lengths)
kurtosis_val <- kurtosis(sepal_lengths)

cat("Skewness:", round(skewness_val, 3), "\n")
## Skewness: 0.312
cat("Interpretation:", 
    ifelse(abs(skewness_val) < 0.5, "Approximately Symmetric",
           ifelse(skewness_val > 0, "Right-skewed (Positive)", "Left-skewed (Negative)")), "\n\n")
## Interpretation: Approximately Symmetric
cat("Kurtosis:", round(kurtosis_val, 3), "\n")
## Kurtosis: 2.426
cat("Interpretation:",
    ifelse(kurtosis_val > 3, "Leptokurtic (heavy tails)",
           ifelse(kurtosis_val < 3, "Platykurtic (light tails)", "Mesokurtic (normal)")), "\n")
## Interpretation: Platykurtic (light tails)

Part 3: Exploratory Data Analysis

3.1 Summary Statistics

# Comprehensive summary statistics
cat("=== IRIS DATASET SUMMARY ===\n\n")
## === IRIS DATASET SUMMARY ===
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# More detailed summary by species
iris_detailed <- iris %>%
  group_by(Species) %>%
  summarise(
    N = n(),
    Mean_Sepal_Length = mean(Sepal.Length),
    SD_Sepal_Length = sd(Sepal.Length),
    Mean_Sepal_Width = mean(Sepal.Width),
    SD_Sepal_Width = sd(Sepal.Width),
    Mean_Petal_Length = mean(Petal.Length),
    SD_Petal_Length = sd(Petal.Length)
  ) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

iris_detailed %>%
  kable(caption = "Detailed Summary by Species") %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  scroll_box(width = "100%")
Detailed Summary by Species
Species N Mean_Sepal_Length SD_Sepal_Length Mean_Sepal_Width SD_Sepal_Width Mean_Petal_Length SD_Petal_Length
setosa 50 5.01 0.35 3.43 0.38 1.46 0.17
versicolor 50 5.94 0.52 2.77 0.31 4.26 0.47
virginica 50 6.59 0.64 2.97 0.32 5.55 0.55

3.2 Data Visualization

3.2.1 Univariate Analysis

# Multiple ways to visualize single variable
par(mfrow = c(2, 2))

# Histogram
hist(iris$Sepal.Length, 
     main = "Histogram",
     xlab = "Sepal Length (cm)",
     col = "steelblue",
     border = "white")

# Density plot
plot(density(iris$Sepal.Length),
     main = "Density Plot",
     xlab = "Sepal Length (cm)",
     lwd = 2,
     col = "darkblue")
polygon(density(iris$Sepal.Length), col = rgb(0, 0, 1, 0.3))

# Boxplot
boxplot(iris$Sepal.Length,
        main = "Boxplot",
        ylab = "Sepal Length (cm)",
        col = "lightgreen")

# Q-Q plot (check normality)
qqnorm(iris$Sepal.Length, main = "Q-Q Plot")
qqline(iris$Sepal.Length, col = "red", lwd = 2)

par(mfrow = c(1, 1))

3.2.2 Bivariate Analysis

# Relationship between two variables
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Sepal Length vs Sepal Width by Species",
    x = "Sepal Length (cm)",
    y = "Sepal Width (cm)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

3.2.3 Multivariate Analysis

# Pairs plot for multiple variables
pairs(iris[, 1:4],
      main = "Scatterplot Matrix of Iris Dataset",
      pch = 21,
      bg = c("red", "green", "blue")[unclass(iris$Species)])

3.2.4 Advanced ggplot2 Visualizations

# Faceted visualization
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
  geom_histogram(bins = 20, alpha = 0.7, color = "white") +
  facet_wrap(~Species, ncol = 1) +
  labs(
    title = "Sepal Length Distribution by Species",
    x = "Sepal Length (cm)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.position = "none"
  )

3.3 Data Quality Checks

# Check for missing values
cat("=== MISSING VALUES CHECK ===\n")
## === MISSING VALUES CHECK ===
missing_summary <- data.frame(
  Variable = names(iris),
  Missing_Count = colSums(is.na(iris)),
  Missing_Percent = round(colSums(is.na(iris)) / nrow(iris) * 100, 2)
)

missing_summary %>%
  kable(caption = "Missing Values Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Missing Values Summary
Variable Missing_Count Missing_Percent
Sepal.Length Sepal.Length 0 0
Sepal.Width Sepal.Width 0 0
Petal.Length Petal.Length 0 0
Petal.Width Petal.Width 0 0
Species Species 0 0
# Check for outliers using IQR method
detect_outliers <- function(x) {
  q1 <- quantile(x, 0.25)
  q3 <- quantile(x, 0.75)
  iqr <- q3 - q1
  lower_bound <- q1 - 1.5 * iqr
  upper_bound <- q3 + 1.5 * iqr
  sum(x < lower_bound | x > upper_bound)
}

cat("\n=== OUTLIER DETECTION ===\n")
## 
## === OUTLIER DETECTION ===
outlier_summary <- iris %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), detect_outliers)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Outlier_Count")

outlier_summary %>%
  kable(caption = "Outlier Count by Variable (IQR Method)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Outlier Count by Variable (IQR Method)
Variable Outlier_Count
Sepal.Length 0
Sepal.Width 4
Petal.Length 0
Petal.Width 0

Part 4: Predictive Analytics

4.1 Simple Linear Regression

# Build a simple linear regression model
# Predict Sepal Width from Sepal Length
model_simple <- lm(Sepal.Width ~ Sepal.Length, data = iris)

# View model summary
summary(model_simple)
## 
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1095 -0.2454 -0.0167  0.2763  1.3338 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.41895    0.25356   13.48   <2e-16 ***
## Sepal.Length -0.06188    0.04297   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519
# Extract key statistics
r_squared <- summary(model_simple)$r.squared
adj_r_squared <- summary(model_simple)$adj.r.squared
coef_intercept <- coef(model_simple)[1]
coef_slope <- coef(model_simple)[2]

cat("\n=== MODEL INTERPRETATION ===\n")
## 
## === MODEL INTERPRETATION ===
cat("Equation: Sepal.Width =", round(coef_intercept, 3), "+", 
    round(coef_slope, 3), "× Sepal.Length\n")
## Equation: Sepal.Width = 3.419 + -0.062 × Sepal.Length
cat("R-squared:", round(r_squared, 4), "\n")
## R-squared: 0.0138
cat("Adjusted R-squared:", round(adj_r_squared, 4), "\n")
## Adjusted R-squared: 0.0072

4.1.1 Visualize Regression

# Plot with regression line
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point(aes(color = Species), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
  labs(
    title = "Simple Linear Regression: Sepal Width vs Sepal Length",
    subtitle = paste("R² =", round(r_squared, 3)),
    x = "Sepal Length (cm)",
    y = "Sepal Width (cm)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

4.1.2 Model Diagnostics

# Diagnostic plots
par(mfrow = c(2, 2))
plot(model_simple, which = 1:4)

par(mfrow = c(1, 1))

cat("\n=== DIAGNOSTIC INTERPRETATION ===\n")
## 
## === DIAGNOSTIC INTERPRETATION ===
cat("1. Residuals vs Fitted: Check for linearity and homoscedasticity\n")
## 1. Residuals vs Fitted: Check for linearity and homoscedasticity
cat("2. Q-Q Plot: Check if residuals are normally distributed\n")
## 2. Q-Q Plot: Check if residuals are normally distributed
cat("3. Scale-Location: Check homoscedasticity\n")
## 3. Scale-Location: Check homoscedasticity
cat("4. Residuals vs Leverage: Identify influential points\n")
## 4. Residuals vs Leverage: Identify influential points

4.2 Multiple Linear Regression

# Build multiple regression model
# Predict Petal Length from multiple predictors
model_multiple <- lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width, 
                     data = iris)

# View summary
summary(model_multiple)
## 
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width, 
##     data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99333 -0.17656 -0.01004  0.18558  1.06909 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.26271    0.29741  -0.883    0.379    
## Sepal.Length  0.72914    0.05832  12.502   <2e-16 ***
## Sepal.Width  -0.64601    0.06850  -9.431   <2e-16 ***
## Petal.Width   1.44679    0.06761  21.399   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.319 on 146 degrees of freedom
## Multiple R-squared:  0.968,  Adjusted R-squared:  0.9674 
## F-statistic:  1473 on 3 and 146 DF,  p-value: < 2.2e-16
# Compare models
cat("\n=== MODEL COMPARISON ===\n")
## 
## === MODEL COMPARISON ===
comparison <- data.frame(
  Model = c("Simple Regression", "Multiple Regression"),
  Predictors = c(1, 3),
  R_squared = c(
    summary(model_simple)$r.squared,
    summary(model_multiple)$r.squared
  ),
  Adj_R_squared = c(
    summary(model_simple)$adj.r.squared,
    summary(model_multiple)$adj.r.squared
  ),
  RMSE = c(
    sqrt(mean(model_simple$residuals^2)),
    sqrt(mean(model_multiple$residuals^2))
  )
)

comparison %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Model Performance Comparison") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
Model Predictors R_squared Adj_R_squared RMSE
Simple Regression 1 0.0138 0.0072 0.4314
Multiple Regression 3 0.9680 0.9674 0.3147

4.3 Making Predictions

# Create new data for predictions
new_data <- data.frame(
  Sepal.Length = c(5.0, 6.0, 7.0),
  Sepal.Width = c(3.0, 3.5, 3.2),
  Petal.Width = c(1.5, 2.0, 2.5)
)

# Make predictions
predictions <- predict(model_multiple, newdata = new_data, interval = "prediction")

# Combine with input data
prediction_results <- cbind(new_data, predictions) %>%
  mutate(across(where(is.numeric), ~round(., 2)))

prediction_results %>%
  kable(caption = "Predictions with 95% Confidence Intervals") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Predictions with 95% Confidence Intervals
Sepal.Length Sepal.Width Petal.Width fit lwr upr
5 3.0 1.5 3.62 2.97 4.26
6 3.5 2.0 4.74 4.10 5.39
7 3.2 2.5 6.39 5.75 7.03

Part 5: Hypothesis Testing

5.1 Understanding Hypotheses

5.1.1 What is a Hypothesis?

A hypothesis is a prediction from a theory that can be tested through data collection and analysis. It’s a specific, testable statement about the relationship between variables.

Key Characteristics of a Good Hypothesis:

  1. Clear and understandable - Anyone should be able to understand what you’re predicting
  2. Testable - You must be able to collect data to test it
  3. Measurable - The variables must be quantifiable
  4. Falsifiable - It must be possible to prove it wrong

5.1.2 Types of Hypotheses

Types of Hypotheses in Statistical Testing
Type Symbol Description Example
Alternative Hypothesis (H₁) H₁ or Hₐ States that an effect WILL occur; the research prediction Manual transmission cars have higher MPG than automatic cars
Null Hypothesis (H₀) H₀ States that NO effect will occur; opposite of alternative hypothesis Manual and automatic cars have the same MPG
Directional Hypothesis H₁ (one-tailed) Predicts the DIRECTION of the effect (e.g., ‘greater than’, ‘less than’) Manual cars have HIGHER MPG than automatic cars
Non-directional Hypothesis H₁ (two-tailed) Predicts an effect exists but NOT its direction (e.g., ‘different from’) Manual and automatic cars have DIFFERENT MPG

5.1.3 Why Can’t We “Prove” a Hypothesis?

Important Concept: We can never prove the alternative hypothesis using statistics. We can only:

  • Reject the null hypothesis (providing support for the alternative hypothesis)
  • Fail to reject the null hypothesis (insufficient evidence for the alternative hypothesis)

“We cannot talk about the null hypothesis being true or the experimental hypothesis being true, we can only talk in terms of the probability of obtaining a particular set of data if, hypothetically speaking, the null hypothesis was true.”

— Field, Miles, & Field (2012)

Figure: Hypothesis Testing Logic Flow

5.2 Hypothesis Testing in R

5.2.1 The p-value

The p-value is the probability of obtaining your observed data (or more extreme) assuming the null hypothesis is true.

Field et al. (2012, ch. 9) stress that this probability is conditional: it tells us how compatible the sample is with H₀ rather than the probability that H₀ itself is true. Always interpret the p-value alongside effect sizes and research context instead of a mechanical “significant/not significant” decision (Field et al., 2012).

Interpretation: - Small p-value (typically < 0.05): Data is unlikely under H₀ → Reject H₀ - Large p-value (≥ 0.05): Data is plausible under H₀ → Fail to reject H₀

Common Significance Levels: - α = 0.05 (5%) - Most common in social sciences - α = 0.01 (1%) - More stringent - α = 0.10 (10%) - More lenient (exploratory research)

5.2.2 Example 1: One-Sample t-test

Research Question: Is the average sepal length of iris flowers different from 6 cm?

Hypotheses: - H₀: μ = 6 (population mean equals 6) - H₁: μ ≠ 6 (population mean is different from 6) [two-tailed]

# One-sample t-test
# Test if mean sepal length differs from 6 cm
test_value <- 6

# Perform the test
result_onesample <- t.test(iris$Sepal.Length, mu = test_value)

cat("=== ONE-SAMPLE T-TEST ===\n\n")
## === ONE-SAMPLE T-TEST ===
cat("Research Question: Is mean sepal length different from", test_value, "cm?\n\n")
## Research Question: Is mean sepal length different from 6 cm?
cat("H₀: μ =", test_value, "\n")
## H₀: μ = 6
cat("H₁: μ ≠", test_value, "\n\n")
## H₁: μ ≠ 6
print(result_onesample)
## 
##  One Sample t-test
## 
## data:  iris$Sepal.Length
## t = -2.3172, df = 149, p-value = 0.02186
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
##  5.709732 5.976934
## sample estimates:
## mean of x 
##  5.843333
cat("\n=== INTERPRETATION ===\n")
## 
## === INTERPRETATION ===
cat("Sample mean:", round(mean(iris$Sepal.Length), 3), "cm\n")
## Sample mean: 5.843 cm
cat("t-statistic:", round(result_onesample$statistic, 3), "\n")
## t-statistic: -2.317
cat("p-value:", format.pval(result_onesample$p.value, digits = 4), "\n")
## p-value: 0.02186
cat("95% CI: [", round(result_onesample$conf.int[1], 3), ",", 
    round(result_onesample$conf.int[2], 3), "]\n\n")
## 95% CI: [ 5.71 , 5.977 ]
if(result_onesample$p.value < 0.05) {
  cat("Decision: REJECT H₀ (p < 0.05)\n")
  cat("Conclusion: The mean sepal length is significantly different from", test_value, "cm.\n")
} else {
  cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
  cat("Conclusion: Insufficient evidence that mean differs from", test_value, "cm.\n")
}
## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The mean sepal length is significantly different from 6 cm.
# Visualize
hist(iris$Sepal.Length,
     main = "Sepal Length Distribution with Test Value",
     xlab = "Sepal Length (cm)",
     col = "lightblue",
     border = "white",
     breaks = 15)
abline(v = mean(iris$Sepal.Length), col = "red", lwd = 3, lty = 1)
abline(v = test_value, col = "blue", lwd = 3, lty = 2)
legend("topright", 
       legend = c(paste("Sample Mean =", round(mean(iris$Sepal.Length), 2)),
                  paste("Test Value =", test_value)),
       col = c("red", "blue"), 
       lty = c(1, 2), 
       lwd = 3)

5.2.3 Example 2: Independent Two-Sample t-test

Research Question: Do manual transmission cars have different fuel efficiency than automatic transmission cars?

Hypotheses: - H₀: μ_manual = μ_automatic (no difference in mean MPG) - H₁: μ_manual ≠ μ_automatic (means are different) [two-tailed]

# Independent samples t-test
# Compare MPG between automatic and manual transmission

# Prepare data
automatic <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]

cat("=== INDEPENDENT TWO-SAMPLE T-TEST ===\n\n")
## === INDEPENDENT TWO-SAMPLE T-TEST ===
cat("Research Question: Do manual and automatic cars differ in fuel efficiency?\n\n")
## Research Question: Do manual and automatic cars differ in fuel efficiency?
cat("H₀: μ_manual = μ_automatic\n")
## H₀: μ_manual = μ_automatic
cat("H₁: μ_manual ≠ μ_automatic\n\n")
## H₁: μ_manual ≠ μ_automatic
# Perform t-test
result_independent <- t.test(manual, automatic, var.equal = FALSE)

print(result_independent)
## 
##  Welch Two Sample t-test
## 
## data:  manual and automatic
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.209684 11.280194
## sample estimates:
## mean of x mean of y 
##  24.39231  17.14737
cat("\n=== DESCRIPTIVE STATISTICS ===\n")
## 
## === DESCRIPTIVE STATISTICS ===
cat("Automatic transmission:\n")
## Automatic transmission:
cat("  n =", length(automatic), "\n")
##   n = 19
cat("  Mean =", round(mean(automatic), 2), "mpg\n")
##   Mean = 17.15 mpg
cat("  SD =", round(sd(automatic), 2), "\n\n")
##   SD = 3.83
cat("Manual transmission:\n")
## Manual transmission:
cat("  n =", length(manual), "\n")
##   n = 13
cat("  Mean =", round(mean(manual), 2), "mpg\n")
##   Mean = 24.39 mpg
cat("  SD =", round(sd(manual), 2), "\n\n")
##   SD = 6.17
cat("=== INTERPRETATION ===\n")
## === INTERPRETATION ===
cat("Difference in means:", round(mean(manual) - mean(automatic), 2), "mpg\n")
## Difference in means: 7.24 mpg
cat("t-statistic:", round(result_independent$statistic, 3), "\n")
## t-statistic: 3.767
cat("p-value:", format.pval(result_independent$p.value, digits = 4), "\n\n")
## p-value: 0.001374
if(result_independent$p.value < 0.05) {
  cat("Decision: REJECT H₀ (p < 0.05)\n")
  cat("Conclusion: Manual and automatic transmissions have significantly different MPG.\n")
  cat("Manual transmission cars have", 
      ifelse(mean(manual) > mean(automatic), "HIGHER", "LOWER"), 
      "fuel efficiency.\n")
} else {
  cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
  cat("Conclusion: No significant difference in MPG between transmission types.\n")
}
## Decision: REJECT H₀ (p < 0.05)
## Conclusion: Manual and automatic transmissions have significantly different MPG.
## Manual transmission cars have HIGHER fuel efficiency.
# Visualize
boxplot(mpg ~ am, data = mtcars,
        names = c("Automatic", "Manual"),
        main = "Fuel Efficiency by Transmission Type",
        ylab = "Miles per Gallon (MPG)",
        col = c("lightcoral", "lightgreen"),
        border = c("darkred", "darkgreen"))
points(c(1, 2), c(mean(automatic), mean(manual)), 
       pch = 19, col = "blue", cex = 2)
legend("bottomright", legend = "Mean", pch = 19, col = "blue", cex = 1.2)

5.2.4 Example 3: Paired t-test

Research Question: Does a new teaching method improve test scores?

Hypotheses: - H₀: μ_difference = 0 (no change in scores) - H₁: μ_difference > 0 (scores increased) [one-tailed]

# Paired t-test
# Create example data: before and after scores
set.seed(seeds[3])

n_students <- 25
before_scores <- rnorm(n_students, mean = 70, sd = 10)
# Add improvement effect
improvement <- rnorm(n_students, mean = 5, sd = 8)
after_scores <- before_scores + improvement

student_data <- data.frame(
  Student = 1:n_students,
  Before = round(before_scores, 1),
  After = round(after_scores, 1),
  Difference = round(after_scores - before_scores, 1)
)

cat("=== PAIRED T-TEST (BEFORE-AFTER DESIGN) ===\n\n")
## === PAIRED T-TEST (BEFORE-AFTER DESIGN) ===
cat("Research Question: Did the new teaching method improve test scores?\n\n")
## Research Question: Did the new teaching method improve test scores?
cat("H₀: μ_difference = 0 (no improvement)\n")
## H₀: μ_difference = 0 (no improvement)
cat("H₁: μ_difference > 0 (scores improved) [one-tailed]\n\n")
## H₁: μ_difference > 0 (scores improved) [one-tailed]
# Show first few students
cat("Sample Data (first 6 students):\n")
## Sample Data (first 6 students):
head(student_data) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Student Before After Difference
1 85.0 84.3 -0.7
2 54.8 55.0 0.2
3 67.6 86.9 19.3
4 59.0 64.8 5.8
5 77.2 90.9 13.7
6 72.7 81.5 8.9
# Perform paired t-test (one-tailed)
result_paired <- t.test(student_data$After, student_data$Before, 
                        paired = TRUE, 
                        alternative = "greater")

cat("\n")
print(result_paired)
## 
##  Paired t-test
## 
## data:  student_data$After and student_data$Before
## t = 3.1343, df = 24, p-value = 0.002251
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
##  2.44324     Inf
## sample estimates:
## mean difference 
##            5.38
cat("\n=== DESCRIPTIVE STATISTICS ===\n")
## 
## === DESCRIPTIVE STATISTICS ===
cat("Before: Mean =", round(mean(student_data$Before), 2), 
    ", SD =", round(sd(student_data$Before), 2), "\n")
## Before: Mean = 70.15 , SD = 11.15
cat("After:  Mean =", round(mean(student_data$After), 2), 
    ", SD =", round(sd(student_data$After), 2), "\n")
## After:  Mean = 75.53 , SD = 14.87
cat("Mean Difference:", round(mean(student_data$Difference), 2), "points\n\n")
## Mean Difference: 5.38 points
cat("=== INTERPRETATION ===\n")
## === INTERPRETATION ===
cat("t-statistic:", round(result_paired$statistic, 3), "\n")
## t-statistic: 3.134
cat("p-value (one-tailed):", format.pval(result_paired$p.value, digits = 4), "\n\n")
## p-value (one-tailed): 0.002251
if(result_paired$p.value < 0.05) {
  cat("Decision: REJECT H₀ (p < 0.05)\n")
  cat("Conclusion: The teaching method significantly improved test scores.\n")
} else {
  cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
  cat("Conclusion: No significant evidence of improvement.\n")
}
## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The teaching method significantly improved test scores.
# Visualize
par(mfrow = c(1, 2))

# Box plot comparison
boxplot(student_data$Before, student_data$After,
        names = c("Before", "After"),
        main = "Score Comparison",
        ylab = "Test Score",
        col = c("lightcoral", "lightgreen"))

# Difference histogram
hist(student_data$Difference,
     main = "Score Changes",
     xlab = "Difference (After - Before)",
     col = "lightblue",
     border = "white")
abline(v = 0, col = "red", lwd = 3, lty = 2)
abline(v = mean(student_data$Difference), col = "blue", lwd = 3)
legend("topright", 
       legend = c("No Change", "Mean Change"),
       col = c("red", "blue"), 
       lty = c(2, 1), 
       lwd = 3)

par(mfrow = c(1, 1))

5.3 Other Common Tests

5.3.1 Chi-Square Test (Categorical Data)

Research Question: Is there a relationship between transmission type and number of cylinders?

# Chi-square test for independence
cat("=== CHI-SQUARE TEST OF INDEPENDENCE ===\n\n")
## === CHI-SQUARE TEST OF INDEPENDENCE ===
cat("Research Question: Are transmission type and cylinder count independent?\n\n")
## Research Question: Are transmission type and cylinder count independent?
cat("H₀: Transmission and cylinders are independent (no relationship)\n")
## H₀: Transmission and cylinders are independent (no relationship)
cat("H₁: Transmission and cylinders are related\n\n")
## H₁: Transmission and cylinders are related
# Create contingency table
contingency_table <- table(mtcars$am, mtcars$cyl)
rownames(contingency_table) <- c("Automatic", "Manual")
colnames(contingency_table) <- paste(c(4, 6, 8), "cyl")

cat("Contingency Table:\n")
## Contingency Table:
print(contingency_table)
##            
##             4 cyl 6 cyl 8 cyl
##   Automatic     3     4    12
##   Manual        8     3     2
# Perform chi-square test
result_chisq <- chisq.test(contingency_table)
cat("\n")
print(result_chisq)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 8.7407, df = 2, p-value = 0.01265
cat("\n=== INTERPRETATION ===\n")
## 
## === INTERPRETATION ===
cat("Chi-square statistic:", round(result_chisq$statistic, 3), "\n")
## Chi-square statistic: 8.741
cat("p-value:", format.pval(result_chisq$p.value, digits = 4), "\n\n")
## p-value: 0.01265
if(result_chisq$p.value < 0.05) {
  cat("Decision: REJECT H₀ (p < 0.05)\n")
  cat("Conclusion: There is a significant relationship between transmission and cylinders.\n")
} else {
  cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
  cat("Conclusion: No significant relationship detected.\n")
}
## Decision: REJECT H₀ (p < 0.05)
## Conclusion: There is a significant relationship between transmission and cylinders.
# Visualize
barplot(contingency_table, 
        beside = TRUE,
        legend = TRUE,
        main = "Transmission Type by Cylinder Count",
        xlab = "Number of Cylinders",
        ylab = "Frequency",
        col = c("lightcoral", "lightgreen"))

5.3.2 Correlation Test

Research Question: Is there a linear relationship between car weight and fuel efficiency?

# Correlation test
cat("=== CORRELATION TEST ===\n\n")
## === CORRELATION TEST ===
cat("Research Question: Is there a correlation between weight and MPG?\n\n")
## Research Question: Is there a correlation between weight and MPG?
cat("H₀: ρ = 0 (no correlation)\n")
## H₀: ρ = 0 (no correlation)
cat("H₁: ρ ≠ 0 (correlation exists)\n\n")
## H₁: ρ ≠ 0 (correlation exists)
# Perform correlation test
result_cor <- cor.test(mtcars$wt, mtcars$mpg)

print(result_cor)
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9338264 -0.7440872
## sample estimates:
##        cor 
## -0.8676594
cat("\n=== INTERPRETATION ===\n")
## 
## === INTERPRETATION ===
cat("Correlation coefficient (r):", round(result_cor$estimate, 3), "\n")
## Correlation coefficient (r): -0.868
cat("t-statistic:", round(result_cor$statistic, 3), "\n")
## t-statistic: -9.559
cat("p-value:", format.pval(result_cor$p.value, digits = 4), "\n\n")
## p-value: 1.294e-10
if(result_cor$p.value < 0.05) {
  cat("Decision: REJECT H₀ (p < 0.05)\n")
  cat("Conclusion: There is a significant", 
      ifelse(result_cor$estimate < 0, "NEGATIVE", "POSITIVE"),
      "correlation.\n")
  cat("Interpretation:", abs(round(result_cor$estimate, 3)), 
      "indicates a",
      ifelse(abs(result_cor$estimate) > 0.7, "strong",
             ifelse(abs(result_cor$estimate) > 0.4, "moderate", "weak")),
      "relationship.\n")
} else {
  cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
  cat("Conclusion: No significant correlation detected.\n")
}
## Decision: REJECT H₀ (p < 0.05)
## Conclusion: There is a significant NEGATIVE correlation.
## Interpretation: 0.868 indicates a strong relationship.
# Visualize
plot(mtcars$wt, mtcars$mpg,
     main = paste("Weight vs MPG (r =", round(result_cor$estimate, 3), ")"),
     xlab = "Weight (1000 lbs)",
     ylab = "Miles per Gallon",
     pch = 19,
     col = "steelblue")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)

5.4 Effect Size and Power

5.4.1 Understanding Effect Size

Effect size measures the magnitude of a difference or relationship, independent of sample size.

Field et al. (2012, ch. 11) argue that reporting effect sizes alongside p-values answers “How big is the effect?” and prevents us from celebrating trivial differences that become significant only because of large samples.

Common Effect Size Measures: - Cohen’s d: For t-tests (small: 0.2, medium: 0.5, large: 0.8) - r: For correlations (small: 0.1, medium: 0.3, large: 0.5) - η² (eta-squared): For ANOVA

# Calculate Cohen's d for transmission comparison
# Install effsize if needed
if (!require("effsize", quietly = TRUE)) {
  install.packages("effsize", repos = "https://cran.r-project.org")
  library(effsize)
}

cat("=== EFFECT SIZE ANALYSIS ===\n\n")
## === EFFECT SIZE ANALYSIS ===
# Cohen's d for transmission effect on MPG
cohens_d <- cohen.d(mtcars$mpg ~ factor(mtcars$am))
print(cohens_d)
## 
## Cohen's d
## 
## d estimate: -1.477947 (large)
## 95 percent confidence interval:
##     lower     upper 
## -2.304209 -0.651685
cat("\n=== INTERPRETATION ===\n")
## 
## === INTERPRETATION ===
cat("Cohen's d:", round(cohens_d$estimate, 3), "\n")
## Cohen's d: -1.478
cat("Magnitude:", cohens_d$magnitude, "\n\n")
## Magnitude: 4
cat("Effect Size Guidelines (Cohen, 1988):\n")
## Effect Size Guidelines (Cohen, 1988):
cat("  Small:  d = 0.2\n")
##   Small:  d = 0.2
cat("  Medium: d = 0.5\n")
##   Medium: d = 0.5
cat("  Large:  d = 0.8\n\n")
##   Large:  d = 0.8
cat("Practical Significance:\n")
## Practical Significance:
if(abs(cohens_d$estimate) < 0.2) {
  cat("The effect is negligible - may not be practically important.\n")
} else if(abs(cohens_d$estimate) < 0.5) {
  cat("The effect is small - some practical importance.\n")
} else if(abs(cohens_d$estimate) < 0.8) {
  cat("The effect is medium - likely practically important.\n")
} else {
  cat("The effect is large - very practically important.\n")
}
## The effect is large - very practically important.

5.5 Summary of Hypothesis Testing

Common Hypothesis Tests in R
Test Purpose Data_Type R_Function
One-Sample t-test Compare sample mean to known value Continuous t.test(x, mu=value)
Independent t-test Compare means of two independent groups Continuous t.test(x, y)
Paired t-test Compare means of paired/related observations Continuous t.test(x, y, paired=TRUE)
Chi-Square Test Test relationship between categorical variables Categorical chisq.test(table)
Correlation Test Test linear relationship between continuous variables Continuous cor.test(x, y)
ANOVA Compare means of 3+ groups Continuous aov() or anova()

5.5.1 Key Reminders

  1. Always state your hypotheses before looking at the data
  2. Check assumptions before running tests (normality, equal variances, etc.)
  3. Report effect sizes alongside p-values
  4. Consider practical significance not just statistical significance
  5. Be cautious with multiple testing - adjust α if doing many tests
  6. Never say “prove” - we reject or fail to reject H₀

Part 6: Practical Applications

6.1 Case Study 1: Air Quality Analysis

# Load and explore air quality data
data(airquality)

cat("=== AIR QUALITY DATASET ===\n")
## === AIR QUALITY DATASET ===
cat("Dimensions:", dim(airquality), "\n")
## Dimensions: 153 6
cat("Variables:", names(airquality), "\n\n")
## Variables: Ozone Solar.R Wind Temp Month Day
# Summary statistics
airquality %>%
  summary()
##      Ozone           Solar.R           Wind             Temp      
##  Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
##  1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
##  Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
##  Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
##  3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
##  Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
##  NA's   :37       NA's   :7                                       
##      Month            Day      
##  Min.   :5.000   Min.   : 1.0  
##  1st Qu.:6.000   1st Qu.: 8.0  
##  Median :7.000   Median :16.0  
##  Mean   :6.993   Mean   :15.8  
##  3rd Qu.:8.000   3rd Qu.:23.0  
##  Max.   :9.000   Max.   :31.0  
## 
# Handle missing values
cat("\nMissing values by variable:\n")
## 
## Missing values by variable:
colSums(is.na(airquality))
##   Ozone Solar.R    Wind    Temp   Month     Day 
##      37       7       0       0       0       0
# Visualize relationships
ggplot(airquality, aes(x = Temp, y = Ozone)) +
  geom_point(aes(color = Month), alpha = 0.6, size = 3) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Ozone Levels vs Temperature",
    subtitle = "New York Air Quality Measurements (1973)",
    x = "Temperature (°F)",
    y = "Ozone (ppb)"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

6.3 Case Study 3: Population vs Sample

# Demonstrate population vs sample concepts
# Use seedhash for reproducible sampling
set.seed(seeds[2])  # Use second seed from our generator

# Population: All iris flowers
population_mean <- mean(iris$Sepal.Length)
population_sd <- sd(iris$Sepal.Length)

cat("=== POPULATION STATISTICS ===\n")
## === POPULATION STATISTICS ===
cat("Population Size:", nrow(iris), "\n")
## Population Size: 150
cat("Population Mean:", round(population_mean, 3), "\n")
## Population Mean: 5.843
cat("Population SD:", round(population_sd, 3), "\n\n")
## Population SD: 0.828
# Take multiple samples
sample_sizes <- c(10, 30, 50)
sample_results <- list()

for(n in sample_sizes) {
  sample_means <- replicate(1000, mean(sample(iris$Sepal.Length, n)))
  sample_results[[as.character(n)]] <- data.frame(
    Sample_Size = n,
    Mean_of_Means = mean(sample_means),
    SD_of_Means = sd(sample_means),
    Min = min(sample_means),
    Max = max(sample_means)
  )
}

sample_comparison <- bind_rows(sample_results) %>%
  mutate(across(where(is.numeric) & !Sample_Size, ~round(., 4)))

cat("=== SAMPLING DISTRIBUTION ===\n")
## === SAMPLING DISTRIBUTION ===
cat("Based on 1000 samples for each size\n\n")
## Based on 1000 samples for each size
sample_comparison %>%
  kable(caption = "Effect of Sample Size on Sampling Distribution") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Effect of Sample Size on Sampling Distribution
Sample_Size Mean_of_Means SD_of_Means Min Max
10 5.8449 0.2560 4.9900 6.710
30 5.8446 0.1337 5.4267 6.320
50 5.8371 0.0930 5.5200 6.112
# Visualize sampling distributions
par(mfrow = c(1, 3))
for(n in sample_sizes) {
  sample_means <- replicate(1000, mean(sample(iris$Sepal.Length, n)))
  hist(sample_means,
       main = paste("Sample Size =", n),
       xlab = "Sample Mean",
       col = "lightblue",
       border = "white",
       xlim = c(5.5, 6.5))
  abline(v = population_mean, col = "red", lwd = 3, lty = 2)
}

par(mfrow = c(1, 1))

Part 7: Summary and Best Practices

7.1 Key Takeaways

1. Data Types Matter

  • Categorical: Nominal, Ordinal, Binary
  • Continuous: Interval, Ratio
  • Choose appropriate statistics and visualizations for each type

2. Descriptive Statistics

Central Tendency: - Mean: Average (sensitive to outliers) - Median: Middle value (robust to outliers) - Mode: Most frequent value

Dispersion: - Range: Max - Min (sensitive to extremes) - IQR: Q3 - Q1 (robust measure) - Standard Deviation: Average deviation from mean

3. Visualization Principles

  • Always visualize your data before analysis
  • Use appropriate plot types for your data
  • Check for outliers and unusual patterns
  • Verify assumptions (normality, linearity, etc.)

4. Statistical Modeling

  • Start simple, add complexity as needed
  • Check model assumptions and diagnostics
  • Compare model performance metrics
  • Validate predictions on new data

7.2 R Code Cheat Sheet

# ===== DATA EXPLORATION =====
# Load data
data(dataset_name)
head(df)              # First 6 rows
tail(df)              # Last 6 rows
str(df)               # Structure
summary(df)           # Summary statistics
dim(df)               # Dimensions
names(df)             # Variable names

# ===== DESCRIPTIVE STATISTICS =====
mean(x)               # Mean
median(x)             # Median
sd(x)                 # Standard deviation
var(x)                # Variance
range(x)              # Range
IQR(x)                # Interquartile range
quantile(x)           # Quartiles
table(x)              # Frequency table

# ===== VISUALIZATION =====
hist(x)               # Histogram
boxplot(x)            # Boxplot
plot(x, y)            # Scatterplot
barplot(table(x))     # Bar plot
pairs(df)             # Scatterplot matrix

# ===== TIDYVERSE =====
df %>%
  filter(condition) %>%       # Filter rows
  select(var1, var2) %>%      # Select columns
  mutate(new_var = ...) %>%   # Create new variables
  group_by(category) %>%      # Group data
  summarise(mean = mean(x))   # Summarize

# ===== MODELING =====
model <- lm(y ~ x, data = df)     # Linear regression
summary(model)                     # Model summary
predict(model, newdata = new_df)  # Predictions
plot(model)                        # Diagnostic plots

# ===== TESTING =====
t.test(x, y)          # T-test
cor.test(x, y)        # Correlation test
chisq.test(x, y)      # Chi-square test
shapiro.test(x)       # Normality test

7.3 Common Mistakes to Avoid

# ❌ BAD: Not checking for missing values
mean(data$variable)

# ✅ GOOD: Handle missing values
mean(data$variable, na.rm = TRUE)

# ❌ BAD: Assuming normality without checking
t.test(group1, group2)

# ✅ GOOD: Check normality first
shapiro.test(group1)
hist(group1)
# Then decide on appropriate test

# ❌ BAD: Not setting seed for reproducibility
sample(data, 10)

# ✅ GOOD: Set seed for reproducible results
set.seed(123)
sample(data, 10)

# ✅ BEST: Use seedhash for reproducible, named seeds
library(seedhash)
gen <- SeedHashGenerator$new("YourName")
seeds <- gen$generate_seeds(5)
set.seed(seeds[1])
sample(data, 10)

# ❌ BAD: Ignoring assumptions
model <- lm(y ~ x)
# Use model without checking

# ✅ GOOD: Check assumptions
model <- lm(y ~ x)
plot(model)  # Check diagnostic plots

7.4 Additional Resources

7.4.2 Learning Resources

7.4.3 Getting Help

# View function documentation
?mean
help(lm)

# Search for functions
??regression

# View examples
example(plot)

# Check package vignettes
browseVignettes("dplyr")

References

  • Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. Sage.

Practice Exercises

Exercise 1: Descriptive Statistics

Using the mtcars dataset:

  1. Calculate mean, median, and standard deviation of mpg
  2. Create a frequency table of cyl (number of cylinders)
  3. Calculate the IQR of hp (horsepower)
  4. Identify any outliers in wt (weight)
# Your code here

Exercise 2: Visualization

  1. Create a histogram of mpg with appropriate labels
  2. Make a boxplot comparing mpg across different cyl values
  3. Create a scatterplot of wt vs mpg with a regression line
# Your code here

Exercise 3: Simple Analysis

  1. Test if there’s a relationship between hp and mpg
  2. Build a linear model predicting mpg from wt
  3. Interpret the model coefficients and R-squared value
# Your code here

Session Information

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] effsize_0.8.1    plotly_4.11.0    moments_0.14.1   kableExtra_1.4.0
##  [5] knitr_1.50       lubridate_1.9.4  forcats_1.0.1    stringr_1.5.2   
##  [9] dplyr_1.1.4      purrr_1.1.0      readr_2.1.5      tidyr_1.3.1     
## [13] tibble_3.3.0     ggplot2_4.0.0    tidyverse_2.0.0  seedhash_0.1.0  
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.4.1         lattice_0.22-7    
##  [5] stringi_1.8.7      hms_1.1.4          digest_0.6.37      magrittr_2.0.4    
##  [9] evaluate_1.0.5     grid_4.5.1         timechange_0.3.0   RColorBrewer_1.1-3
## [13] fastmap_1.2.0      Matrix_1.7-4       jsonlite_2.0.0     httr_1.4.7        
## [17] mgcv_1.9-3         crosstalk_1.2.2    viridisLite_0.4.2  scales_1.4.0      
## [21] lazyeval_0.2.2     textshaping_1.0.4  jquerylib_0.1.4    cli_3.6.5         
## [25] rlang_1.1.6        splines_4.5.1      withr_3.0.2        cachem_1.1.0      
## [29] yaml_2.3.10        tools_4.5.1        tzdb_0.5.0         vctrs_0.6.5       
## [33] R6_2.6.1           lifecycle_1.0.4    htmlwidgets_1.6.4  pkgconfig_2.0.3   
## [37] pillar_1.11.1      bslib_0.9.0        gtable_0.3.6       data.table_1.17.8 
## [41] glue_1.8.0         systemfonts_1.3.1  xfun_0.54          tidyselect_1.2.1  
## [45] rstudioapi_0.17.1  farver_2.1.2       nlme_3.1-168       htmltools_0.5.8.1 
## [49] labeling_0.4.3     rmarkdown_2.30     svglite_2.2.2      compiler_4.5.1    
## [53] S7_0.2.0

End of Week 02: R for Data Analytics Tutorial

ANLY 500 - Analytics I

Harrisburg University